A novel interpretable deep transfer learning combining diverse learnable parameters for improved T2D prediction based on single-cell gene regulatory networks

Accurate deep learning (DL) models to predict type 2 diabetes (T2D) are concerned not only with targeting the discrimination task but also with learning useful feature representation. However, existing DL tools are far from perfect and do not provide appropriate interpretation as a guideline to explain and promote superior performance in the target task. Therefore, we provide an interpretable approach for our presented deep transfer learning (DTL) models to overcome such drawbacks, working as follows. We utilize several pre-trained models including SEResNet152, and SEResNeXT101. Then, we transfer knowledge from pre-trained models via keeping the weights in the convolutional base (i.e., feature extraction part) while modifying the classification part with the use of Adam optimizer to deal with classifying healthy controls and T2D based on single-cell gene regulatory network (SCGRN) images. Another DTL models work in a similar manner but just with keeping weights of the bottom layers in the feature extraction unaltered while updating weights of consecutive layers through training from scratch. Experimental results on the whole 224 SCGRN images using five-fold cross-validation show that our model (TFeSEResNeXT101) achieving the highest average balanced accuracy (BAC) of 0.97 and thereby significantly outperforming the baseline that resulted in an average BAC of 0.86. Moreover, the simulation study demonstrated that the superiority is attributed to the distributional conformance of model weight parameters obtained with Adam optimizer when coupled with weights from a pre-trained model.


Biological networks
We provide an illustration in Fig. 1 for the biological network images used in this study, which were downloaded from 19 and consisted of 224 SCGRN images pertaining to healthy and T2D.The class distribution for these biological network images is balanced (i.e., 224 divided evenly into the two classes).These biological network images were produced with the help of bigSCale package to process the single-cell gene expression data and build regulatory networks, then visualizing networks via the NetBioV package.In terms of the single-cell gene expression data pertaining to healthy controls and T2D patients, it was obtained from ArrayExpress repository under accession number E-MTAB-5061 26 .

Deep transfer learning
Figure 1 demonstrates how our deep transfer learning (DTL) approach is performed.First, we adapt the following pre-trained models: VGG19, DenseNet201 25 , InceptionV3 27 , ResNet50V2 28 , ResNet101V2 28 , SEResNet152 29 , and SEResNeXT101 29 .Each pre-trained model has a feature extraction part (i.e., series of convolutional and pooling layers) for feature extraction and a densely connected classifier for classification.Then, we keep the weights unchanged for the feature extraction part of a pre-trained model and change the densely connected classifier to deal with binary classification instead of 1000 classes.Therefore, when feeding the SCGRN image dataset, we extract features using weights of pre-trained models while training the densely connected classifier from scratch and performing prediction.We refer to models using this type of DTL computations as TFeVGG19, TFeD-enseNet201, TFeInceptionV3, TFeResNet50V2, TFeResNet101V2, TFeSEResNet152, and TFeSEResNeXT101 (see Fig. 1).For the other DTL computations, we keep weights of the bottom layers unchanged in the feature extraction part while performing training from scratch to change weights of top layers in feature extraction part and densely connected layers.
As in TFe-based models, we modify the densely connected classifier dealing with binary classification problem before performing the training phase.As seen in Fig. 2, we refer to models employing this type of deep transfer learning as TFtVGG19, TFtDenseNet201, TFtInceptionV3, TFtResNet50V2, TFtResNet101V2, TFtSEResNet152, and TFtSEResNeXT101.
When changing weights during training, we employed three optimizers: Adam, RMSprop, and SGD 30 .When weights are kept unchanged referring to the transfer of knowledge from pre-trained models using SGD optimizer.In terms of predictions of unseen SCGRN images, predictions are mapped to healthy control subjects if the predicted values are greater than 0.5.Otherwise, predictions are mapped to T2D.

Classification methodology
In this study, we considered seven pre-trained models, namely VGG19, DenseNet201, InceptionV3, ResNet50V2, ResNet101V2, SEResNet152, and SEResNeXT101.Each of the pre-trained models was trained on 1.28 million images from ImageNet database to classify images into 1000 different categories.In terms of TFe-based models, we used the feature extraction part of pre-trained models in which weights were kept unchanged and were used to extract feature from SCGRN images.Moreover, the densely connected classifier was trained from scratch to handle the binary class classification problem.Regarding the TFt-based models, we trained the top layers and densely connected classifier from scratch while retaining the weights of bottom layers unchanged in the feature extraction part.For both TFt-based and TFe-based models, we employed Adam optimizer when updating weights of layers.Moreover, we compared the performance of our deep transfer learning approaches using different optimizers including the baseline (i.e., RMSprop optimizer) as well as against training models from scratch.We set optimization parameters as follows: 0.00001 for the learning rate, 10 for the number of epochs, and 32 for the batch size.In terms of the loss function, we utilized categorical cross-entropy 31 .
To assess the performance of studied models, we employed Balanced Accuracy (BAC), Accuracy (ACC), Precision (PRE), Recall (REC), and F1 computed as follows: (1) BAC = 1 2 where TN designates true negative, corresponding to the number of T2D images that were correctly predicted as T2D.FP designates false positive, corresponding to the number of T2D images that were incorrectly predicted as healthy controls.TP designates true positive, corresponding to the number of healthy control images that were correctly predicted as healthy controls.FN designates false negative, corresponding to the number of healthy control images that were incorrectly predicted as T2D.
To evaluate the results on the whole SCGRN image dataset, we employed five-fold cross-validation as follows.We partitioned the SCGRN image datasets and randomly assigned images into 5 folds.During the first run of five-fold cross-validation, we used 4 of the folds to train our deep learning models and perform predictions to the remaining fold for testing and record the performance results.Such a process was repeated for an additional 4 runs in which performance results were recorded.Finally, we report the average performance results corresponding to the results obtained from five-fold cross-validation.

Implementation details
All experiments were run on a machine equipped with central processing unit (CPU) of Google Colab.The specifications of CPU runtime offered by Google Colab were Intel Xeon Processor with two cores with 2.30 GHz and 13 GB RAM where the installed version of Python is 3.10.11.For the analysis of models, we used R statistical software 32 to run the experiments and utilized the optimg package in R to run Adam optimizer 33 .All plots were performed using Matplotlib package in python 34 .

Training results
In Fig. 3, we illustrate the training accuracy performance results when running five-fold cross-validation.It can be seen that our models outperformed all other models trained from scratch.Specifically, TFeVGG19 and TFtVGG19 achieved average accuracies of 0.976 and 0.962, respectively, while VGG19 achieved an average accuracy of 0.530.TFeDenseNet201 outperformed DensNet201 via achieving an average accuracy of 0.988 while DenseNet201 performed better than TFtDenseNet201 via achieving an average accuracy of 0.982 compared to 0.946.For TFe-and TFt-based models when coupled with ResNet101V2, SEResNet152 and SEResNetXT101, they outperformed their counterparts when not applying deep transfer learning (DTL) models.These superior performance results are attributed to the learned representation using transfer learning.

Testing results
Figures 4 and 5 report the generalization (i.e., test) accuracy performance results and combined confusion matrices, respectively, when five-fold cross-validation is utilized.TFeSEResNeXT101 achieved the highest average accuracy of 0.968.
The second-best model is TFeDenseNet201, achieving an average accuracy of 0.958, followed by TFeVGG19, TFeResNet50V2, TFeSEResNet152, TFeInceptionV3, and TFeResNet101V2 (generating average accuracies of 0.946, 0.940, 0.936, 0.930, and 0.918, respectively).TFt-based models also outperformed all models trained from scratch (see Fig. 4b,c).Particularly, TFt-based models generated average accuracies lower and upper bounded by 0.864 and 0.916, respectively, while models trained from scratch were lower and upper bounded by average accuracies of 0.468 and 0.590.These results demonstrate the superior performance of models employing our DTL computations.
In terms of reporting testing performance results using different metrics, our model TFeSEResNeXT101 outperforms all other models (see Table 1) via achieving an average BAC of 0.97, average PRE of 0.97 (tie with our model TFeSEResNet152), and average F1 of 0.97.Moreover, TFeVGG19 and TFtVGG19 perform better than VGG19.Similarly, TFeDenseNet201, TFeInceptionV3, TFeResNet50V2, TFeResNet101V2, and TFeSEResNet152 performed better than DenseNet201, InceptionV3, ResNet50V2, ResNet101V2, and SEResNet152, respectively.Th same holds true for TFt-based models outperforming their counterparts (i.e., VGG19, DenseNet201, Incep-tionV3, ResNet50V2, ResNet101V2, SEResNet152, and SEResNetXT101).In Table 3, we compare our model TFeVGG19 with Adam optimizer against the best performing baseline TFeVGG19 with RMSprop optimizer, named VGG19 in 19 .It is evident that our model TFeVGG19 with Adam optimizer achieves the highest average BAC of 0.94 while the baseline obtained an average BAC of 0.86.Moreover, when F1 performance measure is considered, TFeVGG19 with Adam optimizer attains the highest average F1 of 0.94 while the baseline achieved an average F1 of 0.88.The same holds true for TFtVGG19, which achieved the highest average BAC of 0.91, highest average F1 of 0.90.
In Fig. 6, we report the running time in seconds for the process of running five-fold cross-validation when utilizing our best model (TFeSEResNeXT101) and TFtSEResNeXT101 compared to their peer SEResNeXT101.

Table 1.
Reported average performance results during the running of five-fold cross-validation on testing using studied models.BAC is balanced accuracy.PRE is precision.REC is recall.The best overall result is underlined and is shown in bold.The method outperforming its counterparts is just underlined.

Model
Optimizer BAC PRE REC F1

Stochastic gradient descent (SGD)
To minimize the objective function Q(θ 0 , θ 1 ) for parameters θ 0 and θ 1 of model H(x i ) , we employ gradient descent optimization algorithms to find θ 0 and θ 1 minimizing the objective function.The optimization problem can be formulated as follows: We utilize SGD, RMSprop, and Adam optimization algorithm to minimize the objective function and estimate the model parameters.For SGD, we initialize the parameters θ 0 and θ 1 according to the uniform distribution U(0, 1) and setting the learning rate η = 0.001 , maximum number of iterations to 3000.Then, in each time, shuf- fling the data of m examples followed by looping m times over the following to update model parameters After the end of looping, the algorithm stops if the maximum number of iterations is reached or �∇Q(θ 0 , θ 1 )� ≤ 0.001:

Simulated data
To demonstrate the efficiency of the proposed deep transfer learning (DTL) models incorporation mixed parameters derived from both SGD and Adam, we conducted simulation studies to explain the superiority behind the proposed models as well as imitate the numerical behavior.Particularly, we consider the following four predictive models: where X ∼ U(0, 1) and ǫ ∼ N(0, 0.2) in which U() and N() are uniform and normal distributions, respectively.For F 1 , when we have X and Y = F 1 (X), we perform the following steps.Let H SGD (x i ) = θ 0 + θ 1 x i (for i = 1..m) be the model in which we want to estimate parameters using (X,Y) data from F 1 coupled with Eq. ( 7).Similarly, let H RMSprop (x i )and H Adam (x i ) be models in which we want to estimate their parameters using Eqs. ( 8 and ( 9), respectively, coupled with (X,Y) data from F 1 .Then, we provide each x i ∈ X to perform predictions correspond- ing to y ′ i .In Fig. 7, we report 2D plots for X and predicted Y ′ = {y ′ 1 , . . ., y ′ m } via each model using data generated according to Eq. ( 10), where SGD refers to plotting (x i , H SGD (x i )) while Adam and RMSprop refer to plotting results obtained via (x i , H Adam (x i )) and (x i , H RMSprop (x i )) , respectively, and i = 1..m.We then repeat this process for an additional 8 runs.Therefore, we have 9 runs in total. (10) Figure 7. Plots for the three models as (x i , H SGD (x i )) , (x i , H Adam (x i )), and (x i , H RMSprop (x i )) for i = 1..m according to x i generated using F 1 .It can be seen from Fig. 7 that model induced via RMSprop has more distributional differences compared to those obtained via SGD and Adam.To quantify distributional differences between SGD and Adam against SGD and RMSprop, we perform the following computations: where d SA measuring the distance between data associated with SGD and Adam.Similarly, d SR measures the distance between data associated with SGD and RMSprop.The lower the distance value, the less the distribution difference is. Figure 11a plots d SA and d SR for the 9 runs.It can be seen that H Adam (x i ) is closer to H SGD (x i ) than H RMSprop (x i ) to H SGD (x i ) in most runs.Moreover, the distributional differences are statistically significant (P-value = 7.28 × 10 −14 from t-test).
These results demonstrate conformance of the weight parameters of models utilizing Adam and SGD optimizers.Figure 8 reports the 2D plots of three induced models as (x i , H SGD (x i )), (x i , H Adam (x i )) , and (x i , H RMSprop (x i )) for i = 1..m using data generated according to Eq. ( 11) (i.e., F 2 ) 36 .It can be clearly seen that the data distributional difference of results via SGD is closer to that of Adam when compared to results obtained with the help of RMSprop.In Fig. 11b, we quantify distributional differences using Eqs.( 14) and (15).It can be shown that Adam is closer to SGD as shown from AdamSGD when compared to that of RMSprop to SGD (i.e., RMSpropSGD) over the 9 runs.The quantification of AdamSGD is attributed to d SA while RMSpropSGD is attributed to d SR .(14) Plots for the three models as (x i , H SGD (x i )) , (x i , H Adam (x i )) , and (x i , H RMSprop (x i )) for i = 1..m according to x i generated using F 2 .
Vol:.( 1234567890 In addition, the distributional differences between AdamSGD and RMSpropSGD are statistically significant (P-value = 7.01 × 10 −7 from t-test).Figures 9 and 10 report 2D plots of (x i , H SGD (x i )), (x i , H Adam (x i )) , and (x i , H RMSprop (x i )) for i = 1..m using generated data of Eqs.(12) (F 3 ) and ( 13) (F 4 ) 37 , where models were induced with SGD, Adam, and RMSprop optimizers.It can be seen from the alignment of Adam with SGD that Adam has a closer data representation to SGD compared to RMSprop to SGD.When quantifying the data distributional differences in Fig. 11c and d, it can be clearly shown that the distributional differences of SGD and Adam (referred to AdamSGD) are closer than SGD to RMSprop over the 9 runs.
These quantified results for AdamSGD and RMSprop are attributed to d SA and d SR , respectively.Additionally, the distributional differences of between AdamSGD and RMSprop were statistically significant (P-value = 3.48 × 10 −12 from t-test when F3 is used while P-value = 1.49× 10 −3 from t-test when F4 is used).These results demonstrate the stable performance when SGD is coupled with Adam.

Discussion
Our deep transfer learning (DTL) models work as follows.In the TFe-based models, the convolutional base (also called the feature extraction part) in the pre-trained model is left unchanged while the densely connected classifier is modified to deal with the binary class classification at hand.Therefore, we applied the features extraction part of pre-trained models to the SCGRN images to extract features followed by a flattening step to train densely connected classifier from scratch.It can be noted that only weights of densely connected classifier are changed according to Adam optimizer while we transferred knowledge (i.e., weights) of the feature extraction part from pre-trained models.In terms of the TFt-based models, we keep weights of the bottom layers in the feature extraction part of pre-trained models unchanged while modifying weights in the proceeding layers including the densely connected classifier according to the Adam optimizer.Moreover, the densely connected Figure 9. Plots for the three models as (x i , H SGD (x i )) , (x i , H Adam (x i )) , and (x i , H RMSprop (x i )) for i = 1..m according to x i generated using F 3 .different model weight parameters, we performed a simulated study.In Figs. 7, 8, 9, 10 and 11, we showed that a model induced with the help of SGD optimizer is closer to a model induced with Adam optimizer when compared to a model induced with the help of RMSprop optimizer.It can be evident from visualized results in our study that SGD and Adam had less distributional differences than that of SGD and RMSprop.That resembles the case of having two related datasets for SGD and Adam against unrelated datasets for SGD and RMSprop.As a result, inferior performance results for models utilizing RMSprop are attributed to the high distributional differences in model weight parameters.
It is worth noting that our DTL models keep weights of many layers unchanged.Therefore, when we trained our models, we had fewer number of updated weights compared to updated weights in models trained from scratch.It can be seen from Fig. 6 that our DTL models are fast and can be adopted into mobile applications.It can be noticed from Tables 2 and 3 that leveraging source task knowledge contributed to improved prediction performance when coupled with updated weight parameters in the target task using Adam optimizer.On the other hand, the transferred knowledge from the source task contributed to degraded performance when coupled with updated weight parameters in the target task using RMSprop and SGD optimizers.Also, when we assessed additional DL models such as ConvNeXtLarge and ConvNeXtTiny, the knowledge transfer contributed to maintain the same performance behavior in which leveraging source domain knowledge when coupled with updated weights in the target task remained to be the best (see Supplementary Tables S2 and S3).

Conclusions and future work
In this paper, we present and analyze deep transfer learning (DTL) models for the task of classifying 224 SCGRN images pertaining to healthy controls and T2D patients.First, we utilized seven pre-trained models (including SEResNet152 and SEResNeXT101) already trained on more than million images from the ImageNet dataset.Then, we left weights in the convolutional base (i.e., feature extraction part) unchanged and thereby transferring knowledge from pre-trained models while modifying the densely connected classifier with the use of Adam optimizer to discriminate heathy and T2D SCGRN images.Another presented DTL models work as follows.

Figure 1 .
Figure1.Flowchart of the deep transfer learning-based approach for the predicting T2D using SCGRNs.Biological Networks: To infer single-cell gene regulatory network (SCGRN), gene expression data are provided to bigSCale (performing clustering and differential expression analysis) changing measured correlation between genes from expression values to Z-score, followed by retaining significant correlations to guide in building a regulatory network.A visualization is performed using NetBioV.Deep Transfer Learning: Transfer learning applying feature extraction with new classifier (TFe) to distinguish between T2D and healthy control SCGRNs.

Figure 2 .
Figure 2. Transfer learning applying fine tuning with new classifier (TFt) to distinguish between T2D and healthy control SCGRNs.

Figure 3 .
Figure 3.The boxplots presenting the average five-fold cross-validation results using the ACC measure for the training folds.(a) Deep transfer learning models using feature extraction (referred with the prefix TFe).(b) Deep transfer learning models using fine tuning (referred with the prefix TFt).(c) Deep learning models trained from scratch.ACC is accuracy.

Figure 4 .
Figure 4.The boxplots presenting the average five-fold cross-validation results using the ACC measure for the testing folds.(a) Deep transfer learning models using feature extraction (referred with the prefix TFe).(b) Deep transfer learning models using fine tuning (referred with the prefix TFt).(c) Deep learning models trained from scratch.ACC is accuracy.

Figure 5 .
Figure 5. Combined confusion matrices for all methods during the running of five-fold cross-validation.

Figure 6 .
Figure 6.Running time comparisons in seconds for selected models when running five-fold cross-validation.

Figure 11 .
Figure 11.Boxplots of the four studied models, F 1 -F 4 , showing the distance distribution over nine runs for AdamSGD and RMSpropSGD.(a) results for F 1 .(b) results for F 2 .(c) results for F 3 .(d) results for F 4 .

Table 2 .
Performance comparison of our best deep transfer learning model under different optimizers during the five-fold cross-validation.BAC is balanced accuracy.Best performance result is shown in bold.

Table 3 .
Performance comparison of our deep transfer learning model against recent baseline methods when five-fold cross-validation is employed.BAC is balanced accuracy.Best performance result is shown in bold.Our model TFeSEResNeXT101 is 208.45 × faster than SEResNeXT101.Also, our model TFtSEResNeXT101 is 3.82 × faster than SEResNeXT101.Moreover, TFeVGG19 and TFtVGG19 are 802.67 × and 2.53 ×, respectively, faster than VGG19.These results demonstrate the computational efficiency of the DTL models, in addition to the highly achieved performance results.